Projet: Système de recommendation de musique Spotify

Présentation

Dans ce projet on cherche via un dataSet kaggle à créer un système de recommendation prenant une musique en input et proposant un certain nombre de titre similaire.

On se base sur un algorithme des K Nearest Neighbors et le projet est en scala.

On pourra retrouver ce notebook sur github

Étapes:

  • Importation des données

  • Préprocessing des données
    Dans cette partie on supprime des abbérations du à des titres de musique trop longue qui amènenent le décalage des informations de notre DataFrame. On remarque aussi qu'il y a des duplicats de musique qu'il faudra traiter plus tard. Ces duplicats sont du à la possibilité pour une musique d'appartenir à plusieur genre différent. Ces musiques auront alors plusieurs lignes associées et même si leurs ID reste les mêmes elles peuvent avoir des mesures différentes.

  • Etude des différentes métriques
    Cette partie permet d'expliquer le sens de chaque métrique disponible dans le dataset. Pour chaque métrique un graphique associé permet de se rendre compte de la répartition des données dans le dataset. Les différentes métriques sont: genre, artist_name, track_name, track_id, popularity, acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence.

  • Typage des données
    Typage du dataset en fonction de leur valeurs.

  • Etude de la correlation de nos features
    Etude de la matrice de correlation de nos features pour s'assurer que nos features ne sont pas trop corrélées.

  • Selection de nos features
    On décide de ne pas se servir de duration_ms et popularity dans la suite de notre modèle car ces valeurs ne semblent pas pertinente pour un système de recommendation.

  • Preparation de nos données
    Dans cette partie on va modifier les champs de nos données pour pouvoir les utiliser dans notre modèle.

    • Encodage de nos variable de texte avec StringIndexer
      On modifie quatre de nos features contenant du texte, genre, key, mode et time_signature afin d'avoir un input numérique.

    • Utilisation d'un One Hot Encoder
      Sur les mêmes features que le StringIndexer pour éviter à notre modèle de penser à une hiéarchie entre nos variables nous allons les passer dans un OneHotEncoder qui les transformera en vecteur de 0 et de 1.

    • Aggregation de nos données
      Il faut régler le problème de la duplication des musiques dans notre dataset. De plus le genre de chaque musique nous intéresse il faut donc pouvoir supprimer les duplicats tout en gardant ces informations. Pour cela on aggrège chaque ligne ensemble en calculant la moyenne de chaque valeur numérique pour les mesures étant des entiers. Pour nos vecteurs créé avec le OneHotEncoder, on utilise la fonction max pour être sûr de ne pas perdre l'information des différents genres.

    • Vector Assembler
      Afin d'avoir une seule colonne regroupant toutes nos features comme input de notre modèle.

    • Normalizer
      Afin que chaque feature ait le même poids on normalize nos données.

  • Algorithme des plus proches voisins
    On utilise la librairie de spark MLlib plus spécifiquement la classe BuckereRandomProjectionLSH qui est une approximation de la méthode des plus proches voisins dans le cas du traitement de grosse donnée.

  • Resultats

  • Utilisation du model une fois entrainé
    Définition d'une fonction permettant de réutiliser le modèle entrainé en prenant un track_id en input.

Défaut de l'étude:

Il n'y a pas de moyen de vérifier la précision du modèle, chacun doit essayer et écouter les titres proposés pour se rendre compte si le modèle marche ou non. Personnellement je trouve les résultats pertinents.

Amélioration:

Création d'une fonction pour la partie d'aggregation des musiques possédant de multiple lignes afin de pouvoir compacter la préparation des données en une seule pipeline.

Import

In [1]:
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
In [2]:
%AddDeps org.vegas-viz vegas_2.11 0.3.10 --transitive
Marking org.vegas-viz:vegas_2.11:0.3.10 for download
Obtained 42 files
In [3]:
%AddDeps org.vegas-viz vegas-spark_2.11 0.3.10 --transitive
Marking org.vegas-viz:vegas-spark_2.11:0.3.10 for download
Obtained 44 files
In [4]:
implicit val render = vegas.render.ShowHTML(kernel.display.content("text/html", _))
render = <function1>
Out[4]:
<function1>
In [5]:
import vegas._
import vegas.render.WindowRenderer._
import vegas.sparkExt._

Data Import

In [11]:
{val df: DataFrame = spark
      .read
      .option("header", true) // utilise la première ligne du (des) fichier(s) comme header
      .option("inferSchema", "true") // pour inférer le type de chaque colonne (Int, String, etc.)
      .csv("./data/SpotifyFeatures.csv")}
// df.printSchema()
println(s"Nombre de lignes : ${df.count}")
println(s"Nombre de colonnes : ${df.columns.length}")
Nombre de lignes : 232725
Nombre de colonnes : 18
In [12]:
%%dataFrame
df
Out[12]:
genreartist_nametrack_nametrack_idpopularityacousticnessdanceabilityduration_msenergyinstrumentalnesskeylivenessloudnessmodespeechinesstempotime_signaturevalence
MovieHenri SalvadorC'est beau de faire un Show0BRjO6ga9RKCKjfDqeFgWV00.6110.389993730.910C#0.346-1.828Major0.0525166.9694/40.814
MovieMartin & les féesPerdu d'avance (par Gad Elmaleh)0BjC1NfoEOOusryehmNudP10.2460.591373730.7370F#0.151-5.559Minor0.0868174.0034/40.816
MovieJoseph WilliamsDon't Let Me Be Lonely Tonight0CoSDzoNIKCRs124s9uTVy30.9520.6631702670.1310C0.103-13.879Minor0.036299.4885/40.368
MovieHenri SalvadorDis-moi Monsieur Gordon Cooper0Gc6TVm52BwZD07Ki6tIvf00.7030.241524270.3260C#0.0985-12.178Major0.0395171.7584/40.227
MovieFabien NatafOuverture0IuslXpMROHdEPvSl1fTQK40.950.331826250.2250.123F0.202-21.15Major0.0456140.5764/40.39
MovieHenri SalvadorLe petit souper aux chandelles0Mf1jKa8eNAf1a4PwTbizj00.7490.5781606270.09480C#0.107-14.97Major0.14387.4794/40.358
MovieMartin & les féesPremières recherches (par Paul Ventimila, Lorie Pester, Véronique Jannot, Michèle Laroque & Gérard Lenorman)0NUiKYRd6jt1LKMYGkUdnZ20.3440.7032122930.270C#0.105-12.675Major0.95382.8734/40.533
MovieLaura MayneLet Me Let Go0PbIF9YVD505GutwotpB5C150.9390.4162400670.2690F#0.113-8.949Major0.028696.8274/40.274
MovieChorusHelka0ST6uPfvaPpJLtQwhE6KfC00.001040.7342262000.4810.00086C0.0765-7.725Major0.046125.084/40.765
MovieLe Club des JuniorsLes bisous des bisounours0VSqZ3KStsjcfERGdcWpFO100.3190.5981526940.7050.00125G0.349-7.79Major0.0281137.4964/40.718

Préprocess des données

Suppression des abérrations

Le data set n'est pas parfait il semble que des données soient décalées si le titre de la musique est trop longue

In [13]:
%%dataframe --limit 5
df.filter(! col("key").isInCollection(Array("C", "G", "D", "C#", "A", "F", "B", "E", "A#", "F#", "G#", "D#")))
Out[13]:
genreartist_nametrack_nametrack_idpopularityacousticnessdanceabilityduration_msenergyinstrumentalnesskeylivenessloudnessmodespeechinesstempotime_signaturevalence
AnimeNanahira"""Oshietekudasai goshujinsama"""4xbsWTKyvlrOztwLqMreEc130.1150.5612292950.9010C#0.0952-2.016Minor0.0613183.9824/4
BluesQueens of the Stone Age"""You Got A Killer Scene There Man..."""6ZZiYOTFuZC1XLJjMiEnvS320.006880.552966270.6170.112B0.238-5.662Minor0.0304101.2144/4
MovieSacha Tran"Quinze ans à peine - Extrait de ""Robin des Bois Le Spectacle"" [Live 2014]"1wsJmPdJTAOFoFnVphXT19100.190.462281600.2520F#0.712-9.375Major0.025679.0114/4
MovieBruce Broughton"3 Incongruities, ""Triptych"": No. 3. Rhythmically with a bounce"6TlSPscwN6ZoNnsjMvceQm00.9720.3496896000.2480.0965D0.0668-12.303Major0.0565168.1034/4
MovieBruce Broughton"3 Incongruities, ""Triptych"": No. 2. Slow in a singing style"4Lg8aGdZCjB8T7P66IHmPe00.9480.3416024930.1970.0334F#0.136-13.894Minor0.0403119.0714/4
In [14]:
val data : DataFrame = df
    .filter($"key".isin("C", "G", "D", "C#", "A", "F", "B", "E", "A#", "F#", "G#", "D#"))
data = [genre: string, artist_name: string ... 16 more fields]
Out[14]:
[genre: string, artist_name: string ... 16 more fields]

Détéction des duplicats

In [15]:
%%dataframe --limit 5
data.groupBy("track_id", "artist_name", "track_name").count()
Out[15]:
track_idartist_nametrack_namecount
3AnPOKKZV1NRhED24p9YeXChorusSai Parameshwar Sai Karuneshwar1
64Jyg9AzWl3AHdnkKPmY4TAdrian Marcel2AM.4
4cIPBRZVBcsk7yiNYgAnqRMadison BeerFools3
1u2ht8xGYJb5Buizx4SanYChorusChal Chal Chal1
6Xz4Pk66OuieSrpau2OdVXChorusJalwath Karala1
In [16]:
%%dataframe
data.filter($"track_name" === "EX - Remix")
Out[16]:
genreartist_nametrack_nametrack_idpopularityacousticnessdanceabilityduration_msenergyinstrumentalnesskeylivenessloudnessmodespeechinesstempotime_signaturevalence
AlternativeKiana LedéEX - Remix4LfoYkTuIPgJ2RlNkN5P5C630.3260.7751961330.5370C0.0792-6.978Major0.14173.4844/40.403
DanceKiana LedéEX - Remix7uToQLjICwg3iPxFiCLBHB00.2710.7681961330.5250C0.0791-7.318Major0.17373.4624/40.399
R&BKiana LedéEX - Remix4LfoYkTuIPgJ2RlNkN5P5C590.3240.771961330.5330C0.0871-7.285Major0.15973.4774/40.388
IndieKiana LedéEX - Remix4LfoYkTuIPgJ2RlNkN5P5C590.3240.771961330.5330C0.0871-7.285Major0.15973.4774/40.388
Children’s MusicKiana LedéEX - Remix4LfoYkTuIPgJ2RlNkN5P5C00.3240.771961330.5330C0.0871-7.285Major0.15973.4774/40.388
PopKiana LedéEX - Remix4LfoYkTuIPgJ2RlNkN5P5C590.3240.771961330.5330C0.0871-7.285Major0.15973.4774/40.388

On peut voir que certaines chansons apparaisent plusieurs fois dans nos datas elles ont des métriques quelque peu différentes mais l'information la plus importante est le genre du morceaux que l'on voudra regrouper pour chaque chanson. On s'en chargera par la suite.

Étude des différentes métriques

Les descriptions des variables se basent sur la documentation de Spotify

Key

The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

In [17]:
%%dataframe
data.groupBy("key").count().sort($"count".desc)
Out[17]:
keycount
C27472
G26293
D23970
C#23096
A22593
F20140
B17627
E17324
A#15442
F#15181
In [18]:
Vegas("Key")
    .withDataFrame(data.groupBy("key").count())
    .encodeX("key", Nom,  scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

La plupart des clés ont leurs valeurs basées sur les lettres, il faudra supprimer les exceptions.

Mode Value

Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

In [19]:
%%dataframe
data.groupBy("mode").count().sort($"count".desc)
Out[19]:
modecount
Major150956
Minor80748

Les valeurs principales sont "Major" et "Minor"

In [20]:
Vegas("Mode")
    .withDataFrame(data.groupBy("mode").count())
    .encodeX("mode", Nominal, scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

Time Signature

An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

In [21]:
%%dataframe
data.groupBy("time_signature").count().sort($"count".desc)
Out[21]:
time_signaturecount
4/4200143
3/423806
5/45177
1/42570
0/48
In [22]:
Vegas("Time Signature")
    .withDataFrame(data.groupBy("time_signature").count())
    .encodeX("time_signature", Nominal,  scale=Scale(bandSize=50))
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

On voit que la majorité des valeur ont une valeur de time_signature de 4/4 ont peut donc supposer que cette variable ne sera pas très explicative.

Acousticness

A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. The distribution of values for this feature look like this: Acousticness distribution

In [23]:
Vegas("Accousticness")
    .withDataFrame(data.groupBy("acousticness").count())
    .encodeX("acousticness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

Danceability

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

In [24]:
Vegas("Danceability")
    .withDataFrame(data.groupBy("danceability").count())
    .encodeX("danceability", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

Energy

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

In [25]:
Vegas("Energy")
    .withDataFrame(data.groupBy("energy").count())
    .encodeX("energy", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .mark(Bar)
    .show

Instrumentalness

Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

In [26]:
Vegas("Instrumentalness")
    .withDataFrame(data.groupBy("instrumentalness").count())
    .encodeX("instrumentalness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

La grande majorité des musique sont donc avec voix

Liveness

Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

In [27]:
Vegas("Liveness")
    .withDataFrame(data.groupBy("liveness").count())
    .encodeX("liveness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Loudness

The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

In [28]:
Vegas("Loudness")
    .withDataFrame(data.groupBy("loudness").count())
    .encodeX("loudness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Il peut exister des valeurs de loudness supérieur à 0 mais cela ne ressemble pas une erreur dans les données.

Speechiness

Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

In [29]:
Vegas("Speechiness")
    .withDataFrame(data.groupBy("speechiness").count())
    .encodeX("speechiness", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Valence

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [30]:
Vegas("Valence")
    .withDataFrame(data.groupBy("valence").count())
    .encodeX("valence", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Tempo

The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

In [31]:
Vegas("Tempo")
    .withDataFrame(data.groupBy("tempo").count())
    .encodeX("tempo", Quant)
    .encodeY("count", Quant)
    .encodeColor(field="count", Quant, scale=Scale(rangeNominals=List("#EA98D2", "#659CCA")))
    .encodeSize(value=11L)
    .mark(Bar)
    .show

Typage des données

Pour choisir le type de chaque variable on se base sur la documentation de Spotify

In [32]:
val dfCasted: DataFrame = data
    .withColumn("duration_ms", $"duration_ms".cast("Int"))
    .withColumn("acousticness", $"acousticness".cast("Float"))
    .withColumn("danceability", $"danceability".cast("Float"))
    .withColumn("energy", $"energy".cast("Float"))
    .withColumn("instrumentalness", $"instrumentalness".cast("Float"))
    .withColumn("liveness", $"liveness".cast("Float"))
    .withColumn("loudness", $"loudness".cast("Float"))
    .withColumn("speechiness", $"speechiness".cast("Float"))
    .withColumn("valence", $"valence".cast("Float"))
    .withColumn("tempo", $"tempo".cast("Float"))
dfCasted = [genre: string, artist_name: string ... 16 more fields]
Out[32]:
[genre: string, artist_name: string ... 16 more fields]

Etude de la correlation des données

In [33]:
import org.apache.spark.ml.feature.VectorAssembler
val colCorr= Array("acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
val assembler = new VectorAssembler()
    .setInputCols(colCorr)
    .setOutputCol("features")
colCorr = Array(acousticness, danceability, duration_ms, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence)
assembler = vecAssembler_7299a10e8ef4
Out[33]:
vecAssembler_7299a10e8ef4
In [34]:
import org.apache.spark.ml.linalg.Matrix
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val Row(coeff1: Matrix) = Correlation.corr(assembler.transform(dfCasted), "features").head
val colNamePairs = colCorr.flatMap(name_from => colCorr.map(name_to => (name_from, name_to)))
val triplesList = colNamePairs.zip(coeff1.toArray)
  .filterNot{case((name_from, name_to), corr) => name_from >= name_to}
  .map{case((name_from, name_to), corr) => (name_from, name_to, corr)}
val corrDf = sc.parallelize(triplesList).toDF("name_from", "name_to", "corr")

corrDf.sort($"corr".desc).show(5)
corrDf.sort($"corr").show(5)
+------------+-----------+-------------------+
|   name_from|    name_to|               corr|
+------------+-----------+-------------------+
|      energy|   loudness| 0.8146067202887777|
|danceability|    valence|   0.54415832574648|
|    liveness|speechiness| 0.5112653589253985|
|danceability|   loudness|0.43403079178908577|
|      energy|    valence|0.43300746664482853|
+------------+-----------+-------------------+
only showing top 5 rows

+----------------+----------------+--------------------+
|       name_from|         name_to|                corr|
+----------------+----------------+--------------------+
|    acousticness|          energy| -0.7229341642500959|
|    acousticness|        loudness| -0.6879526714061998|
|instrumentalness|        loudness|  -0.510346072773586|
|          energy|instrumentalness| -0.3813198112973624|
|    danceability|instrumentalness|-0.36677073817119393|
+----------------+----------------+--------------------+
only showing top 5 rows

coeff1 = 
colNamePairs = Array((acousticness,acousticness), (acousticness,danceability), (acousticness,duration_ms), (acousticness...
<console>:6: error: Symbol 'type scala.AnyRef' is missing from the classpath.
This symbol is required by 'class org.apache.spark.ml.linalg.SparseVector'.
Make sure that type AnyRef is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparseVector.class' was compiled against an incompatible version of scala.
  lazy val $print: String =  {
           ^
1.0                   -0.3587350256039598   ... (10 total)
-0.3587350256039598   1.0                   ...
0.00891152428395838   -0.12412821897443163  ...
-0.7229341642500959   0.3197446219673942    ...
0.31821445468419374   -0.36677073817119393  ...
0.06901616607205185   -0.04189692407309058  ...
-0.6879526714061998   0.43403079178908577   ...
0.15413690123693932   0.13313415465148323   ...
-0.23692404267228767  0.018890904611838646  ...
-0.32119127814732445  0.54415832574648      ...
Out[34]:
Array((acousticness,acousticness), (acousticness,danceability), (acousticness,duration_ms), (acousticness...

Selection de nos features

In [35]:
dfCasted.printSchema
root
 |-- genre: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- acousticness: float (nullable = true)
 |-- danceability: float (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- energy: float (nullable = true)
 |-- instrumentalness: float (nullable = true)
 |-- key: string (nullable = true)
 |-- liveness: float (nullable = true)
 |-- loudness: float (nullable = true)
 |-- mode: string (nullable = true)
 |-- speechiness: float (nullable = true)
 |-- tempo: float (nullable = true)
 |-- time_signature: string (nullable = true)
 |-- valence: float (nullable = true)

Comme aucune de nos variables n'est trop corrélée ( < 0.95) nous pouvons créer un modèle avec toutes les variables précédentes.

Néanmoins l'utilisation de duration_ms et popularity ne semble pas utile pour un système de recommendation basé sur la similitude d'une musique.

Préparation des données

Pour pouvoir utiliser toutes les informations à notre disposition il nous faut modifier nos données pour quelle puisse être utilisé par un algorithme de KNN.

Notamment les variables genre, mode, key, time_signature qui sont sous format string.

Encodage de nos variables de texte avec StringIndexer

Le StringIndexer encode nos labels en indice. On a nos quatres features genre, key, mode et time_signature à modifier.

In [36]:
import org.apache.spark.ml.feature.StringIndexer
val genreIndexer = new StringIndexer()
    .setInputCol("genre")
    .setOutputCol("genreIndex")

val modeIndexer = new StringIndexer()
    .setInputCol("mode")
    .setOutputCol("modeIndex")

val keyIndexer = new StringIndexer()
    .setInputCol("key")
    .setOutputCol("keyIndex")

val tsIndexer = new StringIndexer()
    .setInputCol("time_signature")
    .setOutputCol("tsIndex")
genreIndexer = strIdx_570b80f3117e
modeIndexer = strIdx_835aecdd9ca1
keyIndexer = strIdx_fead7eea0095
tsIndexer = strIdx_a8c908af05d1
Out[36]:
strIdx_a8c908af05d1

Utilisation d'un One Hot Encoder

Pour éviter à notre modèle de penser à une hiéarchie entre nos variables nous allons les passer dans un OneHotEncoder qui les transformera en vecteur de 0 et de 1.

In [37]:
import org.apache.spark.ml.feature.OneHotEncoder
val genreEncoder = new OneHotEncoder()
  .setInputCol("genreIndex")
  .setOutputCol("genreVec")

val modeEncoder = new OneHotEncoder()
  .setInputCol("modeIndex")
  .setOutputCol("modeVec")

val keyEncoder = new OneHotEncoder()
  .setInputCol("keyIndex")
  .setOutputCol("keyVec")

val tsEncoder = new OneHotEncoder()
  .setInputCol("tsIndex")
  .setOutputCol("tsVec")
genreEncoder = oneHot_572b19e0a259
modeEncoder = oneHot_c85f479e5a71
keyEncoder = oneHot_6a21b4c75cb2
tsEncoder = oneHot_f3daec6645ec
warning: there were four deprecation warnings; re-run with -deprecation for details
Out[37]:
oneHot_f3daec6645ec

Création de notre Pipeline

In [38]:
import org.apache.spark.ml.{Pipeline, PipelineModel}

val pipeline = new Pipeline()
    .setStages(Array(genreIndexer, modeIndexer, keyIndexer, tsIndexer,
                     genreEncoder, modeEncoder, keyEncoder, tsEncoder))

val dfEncode = pipeline.fit(dfCasted).transform(dfCasted)

dfEncode.select("genre","genreIndex", "genreVec").show(1)
+-----+----------+---------------+
|genre|genreIndex|       genreVec|
+-----+----------+---------------+
|Movie|      23.0|(26,[23],[1.0])|
+-----+----------+---------------+
only showing top 1 row

pipeline = pipeline_7317baebf3c8
dfEncode = [genre: string, artist_name: string ... 24 more fields]
Out[38]:
[genre: string, artist_name: string ... 24 more fields]

Aggrégation de nos données

Il reste le problème des doublons à régler.

Pour nos valeurs obtenus grâce au One Hot Encoder on utilise la classe Summarizer et la méthode max comme nous somme en présence de vecteur composé de 1 et 0 on obtiendra les différents genre et les variable mode, key, time_signature ne seront pas modifié. (Elles ne variaient pas en fonction du genre.)

Pour les autres métriques ont fait le choix de prendre la moyenne des valeurs sur les différentes itérations comme celle-ci pouvait varier faiblement.

In [39]:
import org.apache.spark.ml.stat.Summarizer

val dfDistinct = dfEncode
    .groupBy("track_id", "artist_name", "track_name")
    .agg(mean("acousticness").alias("acousticness"), 
         mean("danceability").alias("danceability"), 
         mean("energy").alias("energy"), 
         mean("instrumentalness").alias("instrumentalness"), 
         mean("liveness").alias("liveness"), 
         mean("loudness").alias("loudness"), 
         mean("speechiness").alias("speechiness"), 
         mean("tempo").alias("tempo"), 
         mean("valence").alias("valence"),
         Summarizer.max($"modeVec").alias("modeVec"),
         Summarizer.max($"keyVec").alias("keyVec"), 
         Summarizer.max($"tsVec").alias("tsVec"),
         Summarizer.max($"genreVec").alias("genreVec"))
dfDistinct = [track_id: string, artist_name: string ... 14 more fields]
Out[39]:
[track_id: string, artist_name: string ... 14 more fields]

VectorAssembler

Afin d'avoir une seule colonne regroupant toutes nos features on utilise un VectorAssembler.

In [40]:
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
    .setInputCols(Array("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "modeVec", "keyVec", "tsVec", "genreVec"))
    .setOutputCol("features")
assembler = vecAssembler_1d98a2e834fc
Out[40]:
vecAssembler_1d98a2e834fc

Normalizer

Comme nous effectuons un K Nearest Neighbors algorithme nous avons besoin de normaliser nos données sinon il n'y aurait pas de sens dans les distances que nous calculons.

In [41]:
import org.apache.spark.ml.feature.Normalizer
val normalizer = new Normalizer()
  .setInputCol("features")
  .setOutputCol("normFeatures")
normalizer = normalizer_04ba2c3df6c8
Out[41]:
normalizer_04ba2c3df6c8
In [42]:
val assemblerPipeline = new Pipeline()
    .setStages(Array(assembler, normalizer))
assemblerPipeline = pipeline_1d81a2114aaa
Out[42]:
pipeline_1d81a2114aaa
In [43]:
val dfClean = assemblerPipeline.fit(dfDistinct).transform(dfDistinct)
dfClean = [track_id: string, artist_name: string ... 16 more fields]
Out[43]:
[track_id: string, artist_name: string ... 16 more fields]

Nearest Neighbors algorithm

J'utilise un LSH algorithms plus précisément le BuckereRandomProjectionLSH pour calculer mes buckets de nearest neighbors

In [44]:
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH

val brp = new BucketedRandomProjectionLSH()
    .setBucketLength(5.0)
    .setNumHashTables(3)
    .setInputCol("features")
    .setOutputCol("hashes")
brp = brp-lsh_61e7b1b4d610
Out[44]:
brp-lsh_61e7b1b4d610

Training du modèle

Notre pipeline est terminé on va maintenant séparer nos données pour entrainer puis challenger notre model.
Il nous suffit de prendre les artistes et musique qu l'on veut tester.
Je fait le choix personnel de prendre l'artiste Tyler the Creator.
On fait aussi le choix que nos recommendations ne renvoient pas des musiques du même artiste.

In [45]:
val tylerData = dfClean.filter($"artist_name" === "Tyler, The Creator")
val training = dfClean.filter($"artist_name" =!= "Tyler, The Creator")
tylerData = [track_id: string, artist_name: string ... 16 more fields]
training = [track_id: string, artist_name: string ... 16 more fields]
Out[45]:
[track_id: string, artist_name: string ... 16 more fields]
In [46]:
val model = brp.fit(training)
model = brp-lsh_61e7b1b4d610
Out[46]:
brp-lsh_61e7b1b4d610

Résultats

In [47]:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
val key = tylerData.filter($"track_name" === "See You Again").select("features").rdd.map { case Row(v: Vector) => v}.first
key = (51,[0,1,2,3,4,5,6,7,8,9,19,21,29,33,36],[0.3709999918937683,0.5580000281333923,0.5590000152587891,7.48999991628807E-6,0.10899999737739563,-9.222000122070312,0.09589999914169312,78.55799865722656,0.6200000047683716,1.0,1.0,1.0,1.0,1.0,1.0])
Out[47]:
(51,[0,1,2,3,4,5,6,7,8,9,19,21,29,33,36],[0.3709999918937683,0.5580000281333923,0.5590000152587891,7.48999991628807E-6,0.10899999737739563,-9.222000122070312,0.09589999914169312,78.55799865722656,0.6200000047683716,1.0,1.0,1.0,1.0,1.0,1.0])
In [48]:
val resultTyler = model.approxNearestNeighbors(training, key, 4)
resultTyler = [track_id: string, artist_name: string ... 18 more fields]
Out[48]:
[track_id: string, artist_name: string ... 18 more fields]
In [49]:
%%dataframe
resultTyler.select("artist_name", "track_name")
Out[49]:
artist_nametrack_name
Chance the RapperHow Great (feat. Jay Electronica & My cousin Nicole)
Drake6 Man
Flatbush ZombiesFacts (feat. Jadakiss)
J. Coleeverybody dies

Etude des résultats

In [50]:
%%dataframe
tylerData.filter($"track_name" === "See You Again").select("track_id","artist_name", "track_name", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "valence")
Out[50]:
track_idartist_nametrack_nameacousticnessdanceabilityenergyinstrumentalnesslivenessloudnessvalence
7KA4W4McWYRpgf0fWsJZWBTyler, The CreatorSee You Again0.37099999189376830.55800002813339230.55900001525878917.48999991628807E-60.10899999737739563-9.2220001220703120.6200000047683716
In [51]:
resultTyler
    .select("track_id", "artist_name", "track_name", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "valence").show(false)
+----------------------+-----------------+----------------------------------------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+-------------------+
|track_id              |artist_name      |track_name                                          |acousticness        |danceability      |energy            |instrumentalness    |liveness           |loudness          |valence            |
+----------------------+-----------------+----------------------------------------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+-------------------+
|0OT0cCKbSmSMRvyWeqEFBq|Chance the Rapper|How Great (feat. Jay Electronica & My cousin Nicole)|0.47699999809265137 |0.4390000104904175|0.4830000102519989|0.0                 |0.5529999732971191 |-9.449000358581543|0.2919999957084656 |
|4kdfjhj9xNkYU0R8xlDy8k|Drake            |6 Man                                               |0.22699999809265137 |0.7979999780654907|0.5289999842643738|0.0                 |0.11400000005960464|-9.288000106811523|0.36800000071525574|
|1PNfhBdmFikFn4vkrwiq05|Flatbush Zombies |Facts (feat. Jadakiss)                              |0.029200000688433647|0.6890000104904175|0.5379999876022339|2.479999966453761E-5|0.1809999942779541 |-9.097999572753906|0.3160000145435333 |
|1wIQtB3UQ1TfjNMZZqO6eh|J. Cole          |everybody dies                                      |0.30000001192092896 |0.6169999837875366|0.6859999895095825|1.359999987471383E-5|0.10499999672174454|-9.857999801635742|0.5440000295639038 |
+----------------------+-----------------+----------------------------------------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+-------------------+

In [52]:
df.filter($"track_name" === "See You Again" && $"artist_name" === "Tyler, The Creator").select("genre").show()
+-------+
|  genre|
+-------+
|Hip-Hop|
|    Rap|
|    Pop|
+-------+

In [53]:
df.filter($"track_id" === "0OT0cCKbSmSMRvyWeqEFBq" || $"track_id" === "4kdfjhj9xNkYU0R8xlDy8k" || $"track_id" === "1PNfhBdmFikFn4vkrwiq05" || $"track_id" === "1wIQtB3UQ1TfjNMZZqO6eh")
          .select("genre", "track_name", "artist_name").show()
+-------+--------------------+-----------------+
|  genre|          track_name|      artist_name|
+-------+--------------------+-----------------+
|Hip-Hop|      everybody dies|          J. Cole|
|Hip-Hop|               6 Man|            Drake|
|Hip-Hop|How Great (feat. ...|Chance the Rapper|
|Hip-Hop|Facts (feat. Jada...| Flatbush Zombies|
|    Pop|               6 Man|            Drake|
|    Pop|      everybody dies|          J. Cole|
|    Rap|      everybody dies|          J. Cole|
|    Rap|               6 Man|            Drake|
|    Rap|How Great (feat. ...|Chance the Rapper|
|    Rap|Facts (feat. Jada...| Flatbush Zombies|
|    Pop|How Great (feat. ...|Chance the Rapper|
+-------+--------------------+-----------------+

L'algorithme semble avoir fonctionné étant donné les musiques proposées par notre modèle.

Utilisation du model une fois entrainé

Je n'ai pas de métrique me permettant de juger de la qualité de mon analyse. Je propose donc une fonction permettant à chaque utilisateur du notebook de facilement rechercher des recommendations pour un titre de chanson.

In [54]:
import  org.apache.spark.sql.Dataset
def recommend(track_id: String, brp: BucketedRandomProjectionLSH, dfClean: DataFrame): Dataset[_] = {
    var training = dfClean.filter($"track_id" =!= track_id)
    var test = dfClean.filter($"track_id" === track_id)
    var model = brp.fit(training)
    var key = test.filter($"track_id" === track_id).select("features").rdd.map { case Row(v: Vector) => v}.first
    var result = model.approxNearestNeighbors(training, key, 4)
    return result
}
recommend: (track_id: String, brp: org.apache.spark.ml.feature.BucketedRandomProjectionLSH, dfClean: org.apache.spark.sql.DataFrame)org.apache.spark.sql.Dataset[_]
In [55]:
val result =  recommend("7KA4W4McWYRpgf0fWsJZWB", brp, dfClean)
result = [track_id: string, artist_name: string ... 18 more fields]
Out[55]:
[track_id: string, artist_name: string ... 18 more fields]
In [56]:
%%dataframe
result.select("artist_name", "track_name")
Out[56]:
artist_nametrack_name
Chance the RapperHow Great (feat. Jay Electronica & My cousin Nicole)
Drake6 Man
Flatbush ZombiesFacts (feat. Jadakiss)
J. Coleeverybody dies